Oppisinpa kayttamaan RStudiota ja GitHubia! Kurssi vaikuttaa tosi hyvalta :)
Mun GitHub repository: https://github.com/SinaHulkkonen/IODS-project
Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
Sina Hulkkonen 08/11/2018
Describe the work you have done this week and summarize your learning.
I did all the DataCamp exercises for this week. I had huge problems with GitHub, I hope it is working now. I think I learned a lot!
#getting the data
students2014<-read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)
#inspecting the structre and dimensions
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166 7
Students2014 is a dataset with 166 observations in 7 variables. It includes data of 166 students, their attitudes, answers on questions about their learning and the points they get.
library(GGally)
library(ggplot2)
# getting to know the data, stratified by gender
p <- ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
The pink colour is for females and the blue colour for males. The figures show their distributions and how the variables correlate with each other, stratified by gender. For example, it seems that majority of the variables are normally distributed and that attitude and points have the strongest correlation with each other.
#I chose variables attiture, stra and surf, based on the previous section
my_model <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
#stra and surf didn't correlate statistically significantly with points, so they are removed in this model
my_model <- lm(points ~ attitude, data = students2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Linear regression model was applied to test if there is a linear correlation with dependent variable points by explanatory variables attitude, stra and surf.
R-squared tells us how close the data are to the fitted regression line. It is 19% in both models, which means that they practically model this question equally good (or bad).
# create a regression model
my_model2 <- lm(points ~ attitude, data = students2014)
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
plot(my_model2, which=c(1,2,5), par(mfrow = c(2,2)))
Here, in Q-Q plot we see that the points are quite close to the line meaning that the erros is normally distributed (as we are assuming in linear regression). In Residuals vs Fitted values scatter plot we see that the points are allover the plot, not forming a pattern, which means that the error is not depended on explanatory variable (which is good!). Residuals vs Leverage plot we see that there are no observations standing out. This means that there are no single observations pulling the regression line somewhere (causing error). It is a good thing also.
Sina Hulkkonen 17/11/2018
Describe the work you have done this week and summarize your learning.
I did all the DataCamp exercises for this week. I think even though I’m having some frustrating moments (I think we all are..?), once I realize I really learned something new, I’m really delighted I took this course. Below are the exercises for this week.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alc<-read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",", header = TRUE)
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
dim(alc)
## [1] 382 35
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).There are 382 observations (students) in 35 variables.
I picked up the variables age, sex, studytime and absences. I pickec those up, since we usually adjust with at leat age and sex in epidemiology, and I think alcohol consumption might affect study time and school abscences. My hypotheses are: 1. The older the student, the more he/she consumes alcohol. 2. Boys consume more alcohol than girls. 3. Those cosuming more alcohol study less. 4. Those consuming more alcohol have more school absences.
library(ggplot2)
library(tableone)
table1<-CreateTableOne(vars = c("age", "sex", "studytime", "absences"), factorVars = c("sex"),strata = c("high_use"), data = alc)
table1
## Stratified by high_use
## FALSE TRUE p test
## n 270 112
## age (mean (sd)) 16.51 (1.15) 16.78 (1.21) 0.041
## sex = M (%) 113 (41.9) 71 (63.4) <0.001
## studytime (mean (sd)) 2.15 (0.86) 1.76 (0.74) <0.001
## absences (mean (sd)) 4.23 (6.50) 7.96 (9.33) <0.001
g3 <- ggplot(alc, aes(x = high_use, y = age, col=sex))
g3 + geom_boxplot() + ylab("age")+ggtitle("Age by alcohol consumption and sex")
g2 <- ggplot(alc, aes(x = high_use, y = studytime, col=sex))
g2 + geom_boxplot() + ylab("studytime")+ggtitle("Study time by alcohol consumption and sex")
g1 <- ggplot(alc, aes(x = high_use, y = absences, col=sex))
g1 + geom_boxplot() + ylab("absences")+ggtitle("Student absences by alcohol consumption and sex")
Looking at these results, it seems that those who consume more alcohol… 1. are the same age as those who don’t BUT girls tend to consume more alcohol younger and boys older. 2. study less (but only in girls) 3. have more school absences, in both sexes.
m <- glm(high_use ~ age + sex + +studytime + absences, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ age + sex + +studytime + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6465 -0.8345 -0.5958 1.0875 2.1093
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.38942 1.75909 -1.927 0.054004 .
## age 0.15415 0.10371 1.486 0.137182
## sexM 0.80051 0.25462 3.144 0.001667 **
## studytime -0.43984 0.16161 -2.722 0.006496 **
## absences 0.06621 0.01798 3.683 0.000231 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 415.46 on 377 degrees of freedom
## AIC: 425.46
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) age sexM studytime absences
## -3.3894221 0.1541471 0.8005131 -0.4398373 0.0662127
OR <- coef(m) %>% exp
CI<-exp(confint(m))
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.03372816 0.001026569 1.0312270
## age 1.16666248 0.953260213 1.4329365
## sexM 2.22668310 1.357240472 3.6904599
## studytime 0.64414120 0.464578524 0.8768517
## absences 1.06845395 1.033661557 1.1089056
In the linear regression model, the odds ratios with 95% confidence for the variables I included in my study hypothesis are: 1. age: OR=1.17 (95% CI=0.00-1.03) 2. sex: OR= 2.23 (95% CI=1.36-3.69) 3. study time: OR=0.64 (95% CI=0.46-0.88) 4. school absences: OR=1.07 (95% CI=1.03-1.11) Interpretation: In this model, age does not predisct high alcohol usage. Male sex does predict, since its 95% CI doesn’t include 1. This means, that males are approximately double the like to high alcohol consumption. Study time had OR under 1 with 95% CI excluding 1; this means that those who study more are less likely to consume more alcohol. The OR for absences was a liitle over 1 with 95% CI excluding 1, meaning that thse who have more study absences are more likely to drink alcohol. If the variable absences was divided differently (not with a range 0-93), but for example in two categories, the OR might be different (I assume larger).
m <- glm(high_use ~ sex + +studytime + absences, data = alc, family = "binomial")
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability>0.5)
## Warning: package 'bindrcpp' was built under R version 3.4.4
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 261 9
## TRUE 85 27
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col=alc$prediction))
# define the geom as points and draw the plot
g+geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$probability>0.5)%>%prop.table()%>%addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68324607 0.02356021 0.70680628
## TRUE 0.22251309 0.07068063 0.29319372
## Sum 0.90575916 0.09424084 1.00000000
#defining the function
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2460733
Interpretation: If the model I built predicts that it is more probable that a student does consume high amount of alcohol, it can find 27 out of 85+27=112 students who really consume hight amount of alcohol. This means that if I had to decide given by these three variables included in the model, this is the odds that I’d succeed in guessing if they consume a lot of alcohol.
The total proportion of inaccurately classified individuals (= the training error) is 24.6%. Manually this is calculated (9+85)/(261+85+9+27) resulting the same as it should be. I think the model did surprisingly well!
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2460733
The prediction error in my model is 26.2% (in the DataCamp exercise the error was 25.3%, actually). My model has bigger error, meaning that it is worse than the one used in DataCamp exercises by this means.
I didn’t complete the axtra bonus exercise.
Sina Hulkkonen 20/11/2018
Describe the work you have done this week and summarize your learning.
I did all the DataCamp exercises for this week. This week’s exercises reminded me of factor analysis which I’ve done for one article, and I’m getting exited as I realized I might have learnt something new!
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Boston is a dataset with 506 observations in 14 variables. It describes the city of Boston, US. There’s information on the city’s economics, pollution and crime rates, for example. The variables are listed below: < code >crim per capita crime rate by town.
< code >zn proportion of residential land zoned for lots over 25,000 sq.ft.
< code >indus proportion of non-retail business acres per town.
< code >chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
< code >nox nitrogen oxides concentration (parts per 10 million).
< code >rm average number of rooms per dwelling.
< code >age proportion of owner-occupied units built prior to 1940.
< code >dis weighted mean of distances to five Boston employment centres.
< code >rad index of accessibility to radial highways.
< code >tax full-value property-tax rate per $10,000.
< code >ptratio pupil-teacher ratio by town.
< code >black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
< code >lstat lower status of the population (percent).
< code >medv median value of owner-occupied homes in $1000s.
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked _by_ '.GlobalEnv':
##
## table1
library(corrplot)
## corrplot 0.84 loaded
cor_matrix<-cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
The dots represents correlations. The more intense color and the bigger in the dot, the stronger the correlation. Blue means positive and red negative correlation. For example, there is a strong correlation between rad and tax, meaning index of accessibility to radial highways is bigger (easier?) when full-value property-tax rate per $10,000 is greater.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels=c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2450495 0.2524752 0.2524752
##
## Group means:
## zn indus chas nox rm
## low 0.8330360 -0.8701955 -0.194366687 -0.8537939 0.4288482
## med_low -0.0630164 -0.3253375 -0.073485621 -0.5956221 -0.1300593
## med_high -0.3846714 0.2090423 0.113661151 0.4173691 0.1140176
## high -0.4872402 1.0171096 -0.002135914 1.0421828 -0.3997478
## age dis rad tax ptratio
## low -0.8101430 0.8029435 -0.6907746 -0.7265757 -0.4075235
## med_low -0.3429872 0.4076802 -0.5387253 -0.4810545 -0.1026355
## med_high 0.3921240 -0.3878616 -0.4234010 -0.3145400 -0.3462679
## high 0.8471976 -0.8468372 1.6382099 1.5141140 0.7808718
## black lstat medv
## low 0.38548608 -0.76432793 0.506386863
## med_low 0.30319792 -0.13450913 0.006866648
## med_high 0.03741795 -0.01005295 0.187563041
## high -0.79586347 0.79980358 -0.619596402
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.05169673 0.65548206 -1.03029576
## indus 0.03647607 -0.22228005 0.26058612
## chas -0.08429132 -0.04740618 0.23214057
## nox 0.40203191 -0.86577260 -1.24465244
## rm -0.11704514 -0.11962944 -0.19872840
## age 0.18164808 -0.26921818 -0.00917447
## dis -0.05238622 -0.27247021 0.37427869
## rad 3.36557023 0.89307782 -0.05994593
## tax 0.05785314 0.02701338 0.46971185
## ptratio 0.11783453 0.04028700 -0.33028594
## black -0.11069388 0.04949834 0.11077218
## lstat 0.23249290 -0.23449950 0.56876901
## medv 0.22306107 -0.42192327 -0.05741324
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9565 0.0321 0.0114
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes,pch=classes)
lda.arrows(lda.fit, myscale = 10)
It looks here, that LD1 explains 94.8% of the variance of the crime classes.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 17 9 0 0
## med_low 4 17 6 0
## med_high 0 11 11 2
## high 0 0 0 25
It looks that the LDA model didn’t predict everything correctly, but was better with “high”" crime rate quintile than the other quintiles. The model predicted 26 cases of “high” (25 correct). But with quintile “low” it had more error, as it predicted 20 out of (20+11=) 31 correct cases. Also, it classified 13 out of (1+13+4=) 18 “med low” and 15 out of (12+15+1=) 28 cases of “med high” correctly.
# access the MASS package
library(MASS)
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#scale
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)
#calculating the distances
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#searching for the right number of clusters
set.seed(123)
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
library(ggplot2)
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
As wee see in gplot, the line drops most radically, when the number of clusters is 2. In pairs plot, the observations are shown in red or black belonging to each clusted as defined by the variables in the plot.
I didn’t do the bonus exercises.
Sina Hulkkonen 1/12/2018
Describe the work you have done this week and summarize your learning.
I did all the DataCamp exercises for this week. For the first time I didn’t make it to the lecture, so I tried to manage myself with the exercises alone.
library(GGally)
library(dplyr)
library(corrplot)
library(tidyr)
human<-read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",", header = TRUE)
#summary(human)
ggpairs(human)
# compute the correlation matrix and visualize it with corrplot
cor(human)%>%corrplot()
The ‘human’ dataset originates from the United Nations Development Programme. It contains information on education, birth and maternal mortality rates, life expectancy etc in different countries. In ggpairs plot, we see the distribution in each variable and how the variables correlate with each other. In correlation plot the size of the plot tells about how strong the correlation is, red dots representing negative and blue dots positive correlation. For example, maternal mortality and life expectancy have a strong negative correlation.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
#showing the variance captured by the components
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
It looks like the component PC1 captures all the variance. This might be due the dataset is not standardized, and the variable variances may vary tremendously, which leads the PCA to think that the variable with the most variance is the most important (although it might not be). This is why the plot is ugly, I think.
human_std <- scale(human)
# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)
#showing the variance captured by the components
summary(pca_human_std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
Please see the interpretation of the unstandardized data PCA in the previous paraghaph. Now the variance of principal components divides more beatofully, with PC1 capturing 53.6% of the variance and the others less. The plot is prettier. It shows that the arrows form three groups: maternal mortality and adolescent birth rate, percentage of women represantatives in parliament and women:men labour ratio, and the rest of the variables (life expectancy, education rate and ratio and Gross National Income per capita). This tells that they actually might be related and measure a similar feature of the population or country.
It is logical that where there are more women involved in working life, they are also more women as parliament representatives. Also, we do know that women give birth younger in developing countries with poorer healthcare systems. In countries where adolescents give birth, there seems to be higher maternal mortality.
#install.packages("FactoMineR")
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.4.4
library(ggplot2)
#getting tea dataset
data("tea")
# look at the summaries and structure of the data
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
#too many variables to handle, so I decided to make dataset tea_time
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage="quali")
# visualize MCA
plot(mca, invisible=c("var"), habillage="quali")
I decided to keep only some variables in the dataset, since there were too many in the original dataset. MCA is run for my dataset. I printed out both variables and individuals. The first plot (individuals) is informative and tells us which variable factors represent each other. For example, individuals that buy unpacked tea are likely to buy it from tea shop (see the factors levels nearby each other in the plot).
Describe the work you have done this week and summarize your learning.
I’ve returned to my clinical work, so I didn’t make it to the lectures and had minimal time for these exercises. I did all the DataCamp exercises.
At first, I copied the code from datawrangling exercise to get the long forms of these datasets:
library(tidyr)
library(dplyr)
library(ggplot2)
#loading he datasets BPRS and RATS
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
#both datasets include study IDs, and the participants have been divided into categories by treatment (BPRS) and group (RATS)
#converting categorical variables into factors
BPRS$treatment <- factor(BPRS$treatment)
RATS$Group <- factor(RATS$Group)
RATS$ID <- factor(RATS$ID)
##BPRS
#convert to long form
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
#extract the week number
BPRSL <- BPRSL %>% dplyr::mutate(week = as.integer(substr(weeks,5,5)))
##RATS
# Convert data to long form & add Time variable
RATSL <- RATS %>%
gather(key = WD, value = Weight, -ID, -Group) %>%
mutate(Time = as.integer(substr(WD,3,4)))
summary(RATSL)
## ID Group WD Weight Time
## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00
## 4 : 11 Mean :384.5 Mean :33.55
## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 Max. :628.0 Max. :64.00
## (Other):110
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
glimpse(RATSL)
## Observations: 176
## Variables: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1...
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1",...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, ...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8...
The RATS data contains information on rats gaining weight on specific diet. The three groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately weekly, except in week seven when two recordings were taken) over a 9-week period.
library(ggplot2)
# Draw the plot
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
# Standardise the variable Weight
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(Weight_std = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
# Draw the plot with scaled weights
ggplot(RATSL, aes(x = Time, y = Weight_std, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight_std), max(RATSL$Weight_std)))
Here in the first pictures we see, how the weights of the rats in each group developed by time. In the second picture, the weights of the rats are standardized, since they are different in the beginning. This enables us to compare the change they had over time.
# Number of weeks, baseline (week 0) included
n <- RATSL$Time %>% unique() %>% length()
# Summary data with mean and standard error of bprs by treatment and week
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
# Glimpse the data
glimpse(RATSS)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, ...
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 26...
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.5529...
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.5)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
In this picture, we see summary of each group of the rats’ weights mean (whiskers represent standard deviation) in different time points. It looks like all the rats gained weight compared to the starting point, the most in group 2.
# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0).
RATSLS <- RATSL %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
glimpse(RATSLS)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.72...
# Draw a boxplot of the mean versus treatment
ggplot(RATSLS, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Time > 0")
#Cannot do T-test, since group has three levels, lets do chi-squred instead
chisq.test(RATSLS$Group, RATSLS$mean)
## Warning in chisq.test(RATSLS$Group, RATSLS$mean): Chi-squared approximation
## may be incorrect
##
## Pearson's Chi-squared test
##
## data: RATSLS$Group and RATSLS$mean
## X-squared = 32, df = 30, p-value = 0.3675
Here we see the mean weight in each group after the first time point. It seems that the rats weighed more in groups 2 and 3. The chi-squared p-value=0.3675 tells us that there is not significant change in the mean weght of the rats in different groups.
# Add the baseline from the original data as a new variable to the summary data
RATSLS <- RATSLS %>%
mutate(baseline = RATS$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSLS)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 252125 252125 2237.0655 5.217e-15 ***
## Group 2 726 363 3.2219 0.07586 .
## Residuals 12 1352 113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Baseline measurements of the outcome variable in a longitudinal study are often correlated with the chosen summary measure and using such measures in the analysis can often lead to substantial gains in precision when used appropriately as a covariate in an analysis of covariance. Here we see, that te baseline (the weight of the rats in the beginning of the study) is correlated with mean weight gain in the follow-up. This means, that if the groups differ by weight in the beginning, the results might be biased.
summary(BPRSL)
## treatment subject weeks bprs week
## 1:180 Min. : 1.00 Length:360 Min. :18.00 Min. :0
## 2:180 1st Qu.: 5.75 Class :character 1st Qu.:27.00 1st Qu.:2
## Median :10.50 Mode :character Median :35.00 Median :4
## Mean :10.50 Mean :37.66 Mean :4
## 3rd Qu.:15.25 3rd Qu.:43.00 3rd Qu.:6
## Max. :20.00 Max. :95.00 Max. :8
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
glimpse(BPRSL)
## Observations: 360
## Variables: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0"...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
BPRS data contains information of 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.
Please see the data wrangling part! I converted BPRS into longitudinal form previously.
#For this plot to succeed, I have t make new factors
BPRSL$treatment<-factor(BPRSL$treatment)
BPRSL$subject<-factor(BPRSL$subject)
# Plot the BPRSL data
#p<-ggplot(BPRSL, aes(x = week, y = bprs, group = subject))
#p1<- p+geom_line(aes(linetype = treatment))
#p2<-p1+ scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1))
#p3<-p2+scale_y_continuous(name = "bprs (points)")
#p4<-p3+theme(legend.position = "top")
#help("geom_line")
#p4
p<-ggplot(BPRSL, aes(x = week, y = bprs, group = subject))
p1<- p+geom_line(aes(linetype = subject))
p2<-p1+ scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1))
p3<-p2+scale_y_continuous(name = "bprs (points)")
p4<-p3+theme(legend.position = "top")
p4
#ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +geom_line(aes(linetype = treatment)) + scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1))+scale_y_continuous(name = "bprs (points)")+theme(legend.position = "top")
Here we ignore that there are multiple observations belonging to one individual (independence model). For some reason, the command ends up in error. I was able t draw a plot where single individuals bprs scores are shown week by week.
# create a regression model RATS_reg
BPRS_reg <- lm(bprs~week + treatment, data=BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
summary(BPRSL)
## treatment subject weeks bprs week
## 1:180 1 : 18 Length:360 Min. :18.00 Min. :0
## 2:180 2 : 18 Class :character 1st Qu.:27.00 1st Qu.:2
## 3 : 18 Mode :character Median :35.00 Median :4
## 4 : 18 Mean :37.66 Mean :4
## 5 : 18 3rd Qu.:43.00 3rd Qu.:6
## 6 : 18 Max. :95.00 Max. :8
## (Other):252
help("pairs")
pairs(bprs~week, col=BPRSL$treatment, data=BPRSL)
Here we see that bprs points reduce by 2.27 when week increses by one (in one week, in other words) with significant p-value. On the other hand, bprs points seem to increase by 0.57 when treatment 2 is chosen over treatment 1, but this finding is not statistically significant (p=0.661). The scatterplot shows how the bprs points were divided for each treatment group in different time points.
I have to quit here. I ran into some difficulties here, and couldn’t find answers quick enough. I think this’ll do for me :)